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1  Introduction 

Semantic  spaces,  such  as  the  Latent  Semantic  Analysis  (LSA),  Hyperspace  Ana¬ 
log  to  Language  (HAL)  or  Random  Indexing  (RI),  offer  convenient  methods  to 
represent  semantic  relations  between  words  and  concepts,  abstracted  from  a 
distribution  of  documents.  The  distribution  of  documents  determines  the  local 
co-occurrence  pattern  between  words  all  over  the  corpus  and,  then,  determines 
the  semantic  abstracted  from  the  local  distribution.  Such  methods  are  sensitive 
to  the  statistical  properties  on  the  distribution  of  words  over  documents.  For 
instance,  the  semantic  on  the  word  table  abstracted  from  a  scientific  corpus  or 
a  general  corpus  may  be  different.  In  the  first  case,  since  table  may  occur  in 
the  context  of  table  of  correlation  or  table  of  results,  it  would  be  considered  to  be 
associated  to  the  word  correlation  whereas  in  the  second  case,  because  it  may 
co-occur  with  kitchen  or  living-room,  it  would  rather  be  considered  as  similar 
to  chair.  Nevertheless,  the  formal  relation  bearing  the  properties  of  the  distri¬ 
bution  of  word's  co-occurence  and  the  final  semantic  produced  by  Semantic 
space  methods  have  not  been  described  until  now.  In  the  case  of  a  mixed  "sci¬ 
entific  and  general"  corpus,  what  makes  that  the  semantic  of  table  became  more 
similar  to  chair  than  Speerman  and  vice-versa?. 

We  approached  the  Top-stories  task  of  the  Blog-Track'09  using  a  system 
named  Blogosphere  Random  Analysis  using  Texts  (BRAT)  composed  of  two  lay¬ 
ers.  The  first  layer  distributes  and  represents  blogs  posts'  in  different  semantic 
spaces  built  using  Random  Indexing.  The  second  layer  is  an  algorithm  of  re¬ 
trieval  that  have  the  aim  of  navigate  in  the  semantic  space  via  a  ramdom  walk. 
BRAT  have  been  constructed  under  two  main  working  hypothesis  that  we  con¬ 
sidered  important  for  dealing  with  the  semantic  of  the  blogosphere:  the  notion 
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of  semantic  identity  and  the  notion  of  semantic  pollution. 

The  article  is  organized  as  follows.  In  a  first  part,  we  shortly  overview  the 
methods  and  properties  of  semantic  spaces  models.  The  notions  of  semantic 
identity  and  semantic  pollution  are  described  in  general  together  with  their 
practical  implication  within  the  Top-stories  task.  In  the  second  part,  the  BRAT 
system  is  described.  The  third  part  gives  an  overview  of  the  performances  of 
BRAT  for  the  Top-stories  task. 

2  The  cognition  of  Blog  Mining 

2.1  Semantic  Spaces 

Word  Vectors  are  a  family  of  models  that  represent  semantic  similarity  be¬ 
tween  words  in  function  of  the  textual  environment  in  which  those  words  ap¬ 
pear.  The  words  co-occurence  distribution  is  collected,  analyzed  and  trans¬ 
formed  into  a  semantic  space,  in  which  words  or  concepts  are  represented  as 
vectors  in  a  high-dimension  vector  space.  LSA  [Landauer  and  Dumais,  1997], 
Hyper  Analog  to  Language  [Lund  and  Burgess,  1996]  and  Random  Indexing 
[Kanerva  et  al.,  2000]  are  some  exemplars  of  Word  Vectors.  Those  models  are 
based  on  the  Harris  [Harris,  1968]  distributional  hypothesis,  which  states  that 
words  that  appear  in  similar  context  have  similar  meanings.  The  definition  of 
the  unit  of  context  is  a  common  issue  to  all  of  those  models,  even  if  it  is  of 
different  nature  depending  of  the  models.  For  example,  LSA  build  a  word- 
document  matrix,  in  which  each  cell  holds  the  frequency  of  a  specific  word 
i  in  a  specific  unit  of  context  j.  HAL  defines  a  floating  window  of  n  words  that 
scrolls  each  word  of  the  corpus.  Then  build  a  word- word  matrix,  in  which  each 
cell  ciij  contains  the  frequency  a  word  i  co-occurs  with  a  word  j  for  the  consid¬ 
erate  floating  window.  Different  mathematical/statistical  methods  to  abstract 
the  meaning  of  concepts  are  applied  on  the  distribution  of  frequencies  stored 
in  the  word-document  or  word-word  matrix.  The  first  purpose  of  those  math¬ 
ematical  processing  is  to  abstract  the  central  tendency  of  frequencies  variations 
and  to  eliminating  what  can  be  considerate  like  "noise"  caused  by  the  part  of 
specific  use  of  language  associated  to  each  person  or  author.  LSA  uses  a  gen¬ 
eral  method  of  linear  decomposition  of  a  matrix  into  principal  independent 
components,  which  is  called  the  Singular  Value  Decomposition  (SVD).  HAL 
reduces  the  expense  of  computational  complexity  by  retaining  a  small  number 
of  principal  components  of  the  co-occurence  matrix.  Vectorial  representations 
are  used  for  the  storage  and  the  manipulation  of  concepts  meaning.  At  the  end 
of  the  process,  similarity  between  two  words  may  be  calculated  using  different 
methods.  A  classical  method  is  to  calculate  the  value  of  the  cosine  of  the  angle 
between  two  vectors  corresponding  to  a  words  or  a  group  of  words  to  approx¬ 
imate  their  semantic  similarity.  Another  equivalent  method  is  the  pondered 
Euclidian  distance. 

In  sum.  Word  Vectors  inputs  are  a  distribution  of  textual  episodes  defined 
as  unit  of  context.  The  distribution  of  words  co-occurence  is  matched  with  the 
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distribution  of  textual  episodes  in  which  they  appear.  Word  Vectors  outputs 
are  concepts  that  emerged  from  this  distribution's  matching. 

2.2  Semantic  identity  and  pollution 

As  we  started  to  introduce  above,  within  the  frame  of  semantic  space  meth¬ 
ods,  the  semantic  produced  for  a  given  word  depends  of  the  distribution  of  the 
other  words  that  co-occur  with  it.  It  makes  that  no  semantic  of  any  words  is 
given  ex  nihilo,  ie  pre-existing  without  (i)  a  learning  process  realized  on  (ii)  a 
distribution  of  contexts  or  episodes  (ie,  unit  of  experience).  The  final  semantic 
associated  to  a  word  have  an  identity  that  have  been  forged  along  the  process 
of  learning  that  is  realized  by  SVD  for  LSA  or  the  accumulation  for  RI.  The  se¬ 
mantic  identity  for  a  given  word  such  as  table  changes  in  function  of  the  corpus 
in  appears  within. 

The  notion  of  semantic  identity  addresses  not  only  the  scale  of  words  but 
also  the  scale  of  the  semantic  space  it-self.  A  semantic  space  have  a  particu¬ 
lar  identity  that  is  given  by  the  distribution  of  word's  co-occurence  that  the 
space  is  composed  by.  The  notion  of  semantic  identity  is  circular  because  it 
reflects  the  circularity  of  the  distributional  hypothesis,  which  semantic  spaces 
are  based  upon. 

The  notion  of  semantic  identity  does  not  produce  something  very  new  for 
researchers  familiar  with  semantic  spaces  and  the  notion  may  appear  some¬ 
how  trivial  if  it  did  not  allow  to  highlight  a  second  notion  that  we  will  call  the 
semantic  pollution.  In  the  previous  example  of  a  mixed  "scientific  and  general" 
semantic  space,  the  semantic  identity  of  table  is  as  much  forged  by  the  seman¬ 
tic  related  to  science  as  by  the  semantic  related  to  everyday  life.  In  a  general 
semantic  space,  if  a  word  is  similar  to  table,  one  can  make  the  reasonable  as¬ 
sumption  that  this  word  is  not  so  far  similar  to  kitchen  or  house.  In  a  mixed 
"scientific  and  general"  space,  such  an  assumption  became  not  so  much  rea¬ 
sonable,  because  the  semantic  of  table  have  been  some  kind  of  polluted  by  the 
scientific  part  of  the  corpus.  One  can  argue  that  this  semantic  pollution  is  noth¬ 
ing  more  than  polysemy.  It  is  true  for  the  case  of  the  word  table  because  it  is  a 
polysemous  word,  but  the  pollution  of  the  identity  of  the  word  table  have  and 
effect  of  pollution  of  the  identity  of  words  that  it  have  co-occur  together  such  as 
correlation,  Speerman,  kitchen,  house,  etc.  Those  words  are  not  polysemous  words 
but  their  semantic  identity  would  be  polluted  too.  Because  of  table,  words  such 
as  correlation  may  possibly  be  not  so  far  from  living-room  in  term  of  semantic 
similarity.  One  again  the  semantic  pollution  addresses  to  the  scale  of  word  but 
also  to  the  scale  of  the  space  for  the  same  reason  of  circularity  described  above. 


3  Application  to  Blogs  Mining 

The  notion  of  semantic  identity  and  semantic  pollution  are  the  two  main  ideas 
that  are  underlying  our  approach  of  the  analysis  of  the  blogosphere.  In  our 
view,  the  blogosphere  is  a  cognitive  system  that  produces  textual  information 


3 


that  expresses  people's  views  and  ideas  concerning  views  and  ideas  of  others. 
For  the  Top-stories  task  of  the  Blog-Track  of  the  TREC'09,  the  goals  were  (i) 
to  detect  the  headlines  of  the  New  York  Times  that  had  produced  exchanges 
in  the  blogosphere  and  (ii)  for  each  of  these  headline,  to  propose  some  related 
blogs. 

Considering  the  notion  of  semantic  identity,  we  assume  that  the  events  of 
a  given  actuality  produce  some  specific  exchanges  that  are  different  of  the  ex¬ 
changes  produced  relatively  to  the  events  of  another  actuality.  Therefore,  there 
is  an  advantage  in  splitting  exchange  in  period  of  time  in  the  aim  of  extract¬ 
ing  the  semantic  identity  associated  with  the  actuality  that  have  produce  those 
exchanges. 

Within  the  frame  of  semantic  space  models,  the  textual  exchanges  that  are 
produced  during  a  specific  period  of  time  constructs  a  semantic  identity  that 
is  related  to  this  particular  actuality.  Hence,  in  the  first  part  of  the  process,  se¬ 
mantic  spaces  are  build  from  posts  and  commentaries  written  in  a  given  period 
of  time. 

Nevertheless,  even  in  choosing  documents  that  have  been  produced  during 
the  same  period  of  time,  there  is  a  large  part  of  the  selected  exchanges  that  are 
not  related  to  the  headline  of  the  New  York  Time.  Those  "not  related  texts" 
participated  in  the  construction  of  the  semantic  identity  corresponding  to  each 
semantic  space,  but  they  also  pollute  these  semantic  in  the  manner  described 
above.  The  retrieval  algorithm  tries  to  navigate  in  a  semantic  space  taking  into 
account  the  degree  of  semantic  pollution  in  the  space. 


4  BRAT 

The  Basic  idea  behind  our  work  is  that  if  we  provide  any  efficient  and  easy  way 
to  navigate  in  a  semantic  space  containing  both  blogs  posts  and  headlines,  then 
we  can  retrieve  for  each  headline  the  relevant  blogs  posts  by  walking  randomly 
in  the  semantic  space.  However  we  have  to  cope  with  the  semantic  pollution 
of  the  space. 

The  principle  underlying  the  algorithm  is  to  consider  a  representation  of  the 
"semantic  identity  of  the  day"  as  the  sum  of  all  document  vectors  correspond¬ 
ing  to  the  given  day.  Taking  into  account  that  this  representation  might  be 
strongly  affected  by  a  large  amount  of  irrelevant  documents,  from  the  perspec¬ 
tive  of  the  top  stories  of  the  day.  We  defined  a  procedure  that  computes  each 
document's  similarity  with  both  the  "semantic  identity  of  the  day"  and  each  of 
the  headlines.  In  addition,  for  each  headline,  we  rank  a  number  of  posts  using 
a  random  walk  through  the  semantic  space.  The  procedure  is  stopped  when 
satisfying  a  set  of  conditions  that  will  be  developed  beyond. 

Practically,  for  each  topic  and  after  a  pre-processing  phase.  Random  index¬ 
ing  [Sahlgren,  2006]  was  used  to  built  a  semantic  space  containing  the  blog 
posts,  as  well  as  the  headlines,  in  a  window  around  the  date  of  the  topic.  This 
geometric  representation  of  meanings  of  the  episodes  (posts  and  headlines)  is 
then  crawled  using  a  random-walk-like  algorithm  to  find  the  closest  posts  for 
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each  headline.  The  ranking  of  the  headlines  takes  into  account  the  number  of 
steps  needed  to  find  n  relevant  posts  for  a  headline,  together  with  the  density 
of  posts  around  the  headline,  as  well  as  the  average  similarity  between  each 
headline  and  its  associated  posts.  For  each  headline,  the  posts  are  ranked  with 
regard  to  their  similarity  with  the  headline.  Let  us  describe  these  steps  with 
some  more  details. 

4.1  The  Blog08  data  pre-processing 

The  Blog08  collection  was  made  by  crawling  the  blogosphere  during  more  than 
year.  The  data  were  provided  as-is:  without  any  cleaning  and  the  content  of 
the  blogs  posts'  where  stored  in  a  pseudo-XML  format1  which  is  unfortunetly 
not  very  well  suited  to  store  blogs  data.  The  first  not-very-interesting-but- 
necessary  step  was  to  split  the  permalinks  files  and  organize  them  by  posting 
date  (instead  of  crawling  date). 

We  also  took  the  opportunity  during  this  step  to  clean  the  posts  from  the 
parts  that  we  consider  useless  such  us:  CSS  and  Javascript,  but  also  to  extract 
some  general  meta-data  about  posts  and  some  structure  informations  of  the 
blogosphere  such  us  the  inter-comments  netzvork. 

The  last  step  of  the  preparation  of  the  data  was  to  detect  the  languages  of 
the  blogs,  in  order  to  keep  only  english  blogs.  We  use  a  language  categorization 
library2  that  implements  the  algorithms  described  in  [Cavnar  and  Trenkle,  1994] 
to  categorize  texts  using  n-grams. 

4.2  Semantic  space  construction 

The  Semantic  space  method  we  use  in  the  context  of  the  Blog-Track'09  is  Ran¬ 
dom  Indexing  (RI),  which  is  not  a  typical  method  in  the  family  of  Semantic 
space  methods.  Particularities  of  RI  are  that  (i)  it  does  not  create  co-occurrence 
matrix  (but  it  is  possible  if  needed)  and  (ii)  it  does  not  need  heavy  statisti¬ 
cal  treatments  like  SVD  for  LSA.  Contrary  to  the  other  Word  Vector  models, 
RI  is  based  on  random  projection,  a  method  that  approximate  statistics  co¬ 
occurences,  and  allows  to  scale  to  huge  number  of  documents.  The  construc¬ 
tion  of  a  semantic  space  with  RI  is  as  follows: 

•  Create  a  matrix  A(d  x  TV),  containing  Index  vectors,  where  d  is  the  num¬ 
ber  of  documents  or  contexts  and  TV ,  the  number  of  dimensions  (TV  > 
1000)  decided  by  the  experimenter.  Index  vectors  are  sparse  and  ran¬ 
domly  generated.  They  consist  in  small  numbers  +1  and  -1  and  thou¬ 
sands  of  0. 

•  Create  a  matrix  B(t  x  TV),  containing  term  vectors,  where  t  is  the  number 
of  different  terms  in  the  corpus.  Set  all  vectors  with  null  values  to  start 
the  semantic  space  construction. 

1The  files  of  the  Blog08  collection  are  not  in  a  well  formed  XML  format,  and  the  preparation  of 
the  data  was  a  very  time  and  resource  consuming  task 
2http:/ /  olivo.net/software/lc4j/ 
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•  Scan  each  document  of  the  corpus.  Each  time  a  term  r  appears  in  a  docu¬ 
ment  S,  accumulate  the  randomly  generated  ri-index  vector  to  the  r-term 
vector. 

At  the  end  of  the  process,  term  vectors  that  appeared  in  similar  contexts  have 
accumulated  similar  index  vectors.  There  is  a  training  cycle  option  in  the 
model.  When  the  scan  has  been  computed  for  all  documents,  the  matrix  B 
is  charged  for  all  term  vectors.  Then  a  matrix  A' (d!  x  N),  with  d!  -  d  can 
be  computed  with  the  output  of  term  vectors.  The  number  of  training  cy¬ 
cle  is  a  parameter  in  the  model.  The  training  process  improves  the  quality 
of  the  Semantic  space.  The  RI  model  has  performed  in  TOEFL  synonymy  test 
[Kanerva  et  al.,  2000,  Karlgren  and  Sahlgren,  2001]  as  well  as  in  text  categoriza¬ 
tion  [Sahlgren  and  Coster,  2004]. 

For  each  topic  (a  date  D )  a  semantic  space  SSd  is  built  relying  on  the  Se¬ 
mantic  Vectors3  library  [Widdows  and  Ferraro,  2008].  The  semantic  space  con¬ 
tains  two  kinds  of  episodic  documents:  (i)  all  the  headlines  in  a  window4  [0—1, 
D  +  1],  (ii)  all  the  english  posts5  in  a  window  [D  1,  I)  +  3]. 

4.3  A  random  walk  in  the  semantic  space 

Once  the  semantic  space  SSd  of  a  day  D  constructed,  we  use  a  random-walk- 
like  algorithm  to  navigate  in  the  space  in  order  to  retrieve  for  each  headline  n 
related  blog  posts. 

We  call  a  prototype  for  a  category  of  a  set  of  documents  (blog  posts  or  head¬ 
lines),  a  pseudo  document  represented  in  the  semantic  space  by  the  sum  of  all 
the  vectors  in  the  set.  For  instance,  the  prototype  of  all  the  headlines  is  a  pseudo 
document  Ph  represented  by  the  vector: 

Ph  =Y.K  (1) 

h£H 

where  H  is  the  set  containing  all  the  headlines  of  S So- 

Given  a  headline  hi  £  SSd  and  r/  £  N,  we  call  ij— neighbourhood  of  hi  w.r.t 
a  prototype  P,  the  set  of  blogs  posts  defined  as  follow: 

r)  —  neighbourhood  (hi,  P)  =  {bj\d(bj,hi)  <  3  —  }  (2) 

where  d(di,  dj)  is  an  euclidien  distance  in  the  semantic  space  between  the  vec¬ 
tors  di  and  dj. 

In  order  to  retrieve  the  n  related  blog  posts  for  the  headline  hi,  we  choose  a 
threshold  m  >  n,  we  walk  randomly  through  the  set  B  containing  all  the  blog 
posts  of  SSd  until  founding  m  candidates  posts  in  the  ij— neighbourhood  of  hi 
w.r.t  the  prototype  Ph  of  all  the  headlines.  If  we  found  m  candidates  posts,  we 

3http://code.google.com/p/semanticvectors/ 

4Except  for  the  run  ri2049rw3  where  only  the  headlines  of  the  day  D  were  considered. 

3We  choose  to  consider  as  an  episode  the  document  containing  a  blog  post  and  its  comments. 
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define  the  score  p,  of  the  headline  hi  as  the  number  of  steps  we  walked  in  B.  If 
the  number  of  founded  blog  posts  m!  <  m  then  the  score  p,  of  /i,  is  defined  as 


Pi  —  card(B)  —  m'  (3) 

In  each  set  It,  containing  the  founded  blog  posts  for  ht,  we  keep  the  min(n,  to') 
closest  blog  posts  to  h,  as  the  related  posts.  And  the  headlines  are  ranked  in 
ascending  order  of  pi . 


5  Results 

The  submitted  runs  implement  different  hypothesis  concerning  the  organisa¬ 
tion  of  the  knowledge  in  semantic  space  build  from  the  blogosphere.  The  runs 
correspond  to  different  values  of  r)  used  for  the  random  walk  algoritm. 

In  the  runs  ri2049rw3  corresponds  to  an  application  of  the  algorithm  with 
rj  =  3  in  a  2049  dimensions  space. 

The  run  ril025rw5432  corresponds  to  an  adaptative  algorithm  using  the 
same  principle,  and  where  the  results  of  the  random  walk  with  77  =  5, 4, 3,  2  are 
combined. 

The  run  ril025rw5h2b  uses  a  similar  algorithm  but  with  little  modification 
of  the  definition  neighbourhood.  The  used  neighbourhood  is  the  intersection 
of  the  5-neighbourhood  w.r.t  to  Pr-r  and  the  2-neighbourhood  w.r.t  to  Pb- 

The  run  ril025rw2b  corresponds  to  an  application  of  the  algorithm  with  the 
2-neighbourhood  w.r.t  to  Pb- 


Run  ID 

R-Precision  >  Median 

%  of  Retrieved  Relevant  Headlines 

ril025rw2b 

32 

24% 

ril025rw5432 

32 

24% 

ril025rw5h2b 

34 

24% 

ri2049rw3 

26 

20% 

Table  1:  Comparaison  of  R-Precision  and  Number  of  Relevant  Retrieved  Head¬ 
lines  with  the  Median  values 

The  obtained  results  are  summerized  in  Table  1  and  Figure  1.  The  good 
performance  of  the  adaptative  run  (ril025rw5432)  and  the  even  better  perfor¬ 
mance  of  the  double  constraint  run  (ril025rw5h2b)  constitutes  good  arguments 
in  favour  of  the  validity  of  the  notion  of  semantic  identity  and  semantic  pollu¬ 
tion. 


6  Conclusions 

The  original  contribution  of  our  work  is  to  propose  a  simple  and  efficient  al¬ 
gorithm  to  navigate  in  a  semantic  space  in  which  the  semantic  of  blog  posts  is 
supposed  to  be  strongly  polluted  because  of  a  number  of  irrelevant  posts.  The 
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Figure  1 :  Retrieved  relevant  posts  and  R-Precision  for  the  4  runs  with  reference 
values 
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principle  underlying  the  algorithm  is  to  consider  a  representation  of  the  "se¬ 
mantic  identity  of  the  day"  as  the  sum  of  all  document  vectors  corresponding 
to  the  given  day.  Taking  into  account  that  this  representation  might  be  strongly 
affected  by  a  large  amount  of  irrelevant  documents,  from  the  perspective  of  the 
top  stories  of  the  day  We  defined  a  procedure  that  computes  each  document's 
similarity  with  both  the  "semantic  identity  of  the  day"  and  each  of  the  head¬ 
lines.  In  addition,  for  each  headline,  we  rank  a  number  of  posts  using  a  random 
walk  through  the  semantic  space. 
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